KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline
نویسنده
چکیده
Clustering of short texts in narrow domains is one of the most difficult tasks due to the high overlapping of vocabularies among the texts and also to the specific terminology used by researchers. Here, we are presenting a new corpus of scientific texts in medicine domain, specifically about “Cancer” topics. This corpus is a subset of the last MEDLINE sample, made up of 900 abstracts of 16 different categories. This compilation is provided as a dataset for the evaluation of algorithms in this area. Preliminary experiments carried out with this corpus highlight its difficulty and reinforce the hypothesis of using it in this challenging
منابع مشابه
A Self-enriching Methodology for Clustering Narrow Domain Short Texts
s of Scientific Texts Using the Transition Point Technique. Proc. CICLing Conference—CICLing’06, Mexico city, Mexico, February 19–25, Lecture Notes in Computer Science 3878, pp. 536–546. Springer, Berlin. [24] Alexandrov, M., Gelbukh, A. and Rosso, P. (2005) An Approach to Clustering Abstracts. Proc. 10th Int. Conf.Application of Natural Language to Information Systems— NLDB’05, Alicante, S...
متن کاملDensity-based clustering of short-text corpora∗ Agupamiento de textos cortos basado en densidad
In this work, we analyse the performance of different density-based algorithms on short-text and narrow domain short-text corpora. We attempt to determine to what extent the features of this kind of corpora impact on the density computation of the clusterings obtained and how robust these algorithms to the different complexity levels are.
متن کاملBioDCA Identifier: A System for Automatic Identification of Discourse Connective and Arguments from Biomedical Text
This paper describes a Natural language processing system developed for automatic identification of explicit connectives, its sense and arguments. Prior work has shown that the difference in usage of connectives across corpora affects the cross domain connective identification task negatively. Hence the development of domain specific discourse parser has become indispensable. Here, we present a...
متن کاملDeveloping a Corpus-Based Word List in Pharmacy Research Articles: A Focus on Academic Culture
The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...
متن کاملQuestion answering in biomedicine
The recent developments in Question Answering have kept with open-domain questions and collections, sometimes argued as being more difficult than narrow domain-focused questions and corpora. The biomedical field is indeed a specialized domain; however, its scope is fairly broad, so that considering a biomedical QA task is not necessarily such a simplification over open-domain QA as represented ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006